1 1. Main
The task is as following: Uber trip data: 1.1 Load the data [uber-data.csv] Discover and comment clusters of Uber data based on locations (longitude & latitude) Analyze the cluster centers by time Analyze the cluster centers by date Remember to choose the right algorithm, compute the optimal number of clusters and quality measures Develop adequate plots Apply the dataset for forecasting
1.1 1.1 Load Libraries
# install.packages("tidyverse")
# install.packages("lubridate")
# install.packages("cluster")
# install.packages("factoextra")
# install.packages("forecast")
install.packages("leaflet") # not installed yet on my machine
library(tidyverse) # For data loading, manipulation, and plotting
library(lubridate) # For easier date-time parsing
library(cluster) # For k-Means and silhouette analysis
library(factoextra) # For cluster visualization
library(forecast) # For time-series forecasting
library(leaflet) # For map of New York display
library(dbscan) #
# Set system language as English
Sys.setlocale("LC_ALL", "en_US.UTF-8")
## [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
Sys.setenv(LANGUAGE = 'en')
# Seed seed
set.seed(123)
1.2 1.2 Load data
# Load the data
uber_data_raw <- read_csv("uber-data.csv")
# Display the first few rows
print("Raw data:")
## [1] "Raw data:"
head(uber_data_raw)
## # A tibble: 6 × 4
## `Date/Time` Lat Lon Base
## <chr> <dbl> <dbl> <chr>
## 1 9/1/2014 0:01:00 40.2 -74.0 B02512
## 2 9/1/2014 0:01:00 40.8 -74.0 B02512
## 3 9/1/2014 0:03:00 40.8 -74.0 B02512
## 4 9/1/2014 0:06:00 40.7 -74.0 B02512
## 5 9/1/2014 0:11:00 40.8 -73.9 B02512
## 6 9/1/2014 0:12:00 40.7 -74.0 B02512
# Display the structure
print("Structure of raw data:")
## [1] "Structure of raw data:"
glimpse(uber_data_raw)
## Rows: 1,028,136
## Columns: 4
## $ `Date/Time` <chr> "9/1/2014 0:01:00", "9/1/2014 0:01:00", "9/1/2014 0:03:00"…
## $ Lat <dbl> 40.2201, 40.7500, 40.7559, 40.7450, 40.8145, 40.6735, 40.7…
## $ Lon <dbl> -74.0021, -74.0027, -73.9864, -73.9889, -73.9444, -73.9918…
## $ Base <chr> "B02512", "B02512", "B02512", "B02512", "B02512", "B02512"…
column names “Date/Time” (as a character -> parse), “Lat” (lattitude), “Lon” (longitude), and “Base” (not figured out yet). There are over 1 mil rows.
1.3 1.3 Summary statistics
summary(uber_data_raw)
## Date/Time Lat Lon Base
## Length:1028136 Min. :39.99 Min. :-74.77 Length:1028136
## Class :character 1st Qu.:40.72 1st Qu.:-74.00 Class :character
## Mode :character Median :40.74 Median :-73.98 Mode :character
## Mean :40.74 Mean :-73.97
## 3rd Qu.:40.76 3rd Qu.:-73.96
## Max. :41.35 Max. :-72.72
Nothing extraordinary here.
Base ‘B02764’ handled the most trips, while ‘B02512’ handled the fewest. No clue what it refers to.